I’m began my PhD studies at University of Helsinki this Autumn and I’m expecting to be using R as part of my research project in musicology, employing a digital humanities approach. I’ve previously been taking courses on statistics (including analysis in R) via the open universities at University of Helsinki and the Hanken School of Economics. I recently got a new computer (and unfortunately lost most of my scripts from earlier courses), so I’m taking a fresh start on using R again now.
Hitherto, I’ve most often processed my numbers using spreadsheets such as Excel and Google Sheets, and I’ve developed techniques to do my calculations quickly in these programs. The downside to this is that I’ve most often felt that the threshold has been lower to use a spreadsheet than using R when processing numbers (and the quickest calculator is still the Spotlight in Mac Os X). So during this course, I hope to be able to learn some useful techniques for making sense of data quickly. Perhaps R Studio might become my program of choice some day. :)
The address to my Github repository is http://github.com/wilhelmkvist/IODS-project.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 13 01:55:10 2021"
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 13 01:55:10 2021"
The dataset ‘Learning 2014’ contains data from a survey among students in the social sciences at University of Helsinki. The data was collected during an introductory course in statistics in late 2014 and early 2015. Participation was voluntary, but students were strongly encouraged to take part as they were reimbursed with extra points in the final exam. Respondents (N=183) were presented with a set of questions designed to reflect their attitudes towards statistics and university studies in general. Learning approaches were measured with questions designed to reflect deep, surface, and strategic approaches. The present dataset (a subset of a larger dataset) contains 166 observations (by all those who participated in the final exam) of 7 variables. In addition to mean scores reflecting attitude and learning approaches, the total points in the final exam are provided. Background information was provided on the respondents’ age and gender (female/male).
setwd("~/Documents/IODS-project")
learning2014 <-
read.table("data/learning2014.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Below is an overview of the different variables. The course was clearly dominated by females, accounting for two of three students. Note the age span from 17 to 55 years (mean 25.5 years, median 22 years).
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The chart below outlines attitude, learning approaches and final exam score (points) by gender. Males generally scored higher on attitude and deep learning approach, while females scored higher on strategic and surface learning approaches. The gender divide in the final exam was about equal, with only a few males outperforming females in the range with the very highest scores. Although males were generally slightly older, age and final score were not positively but negatively correlated for males (-0.24), for females the correlation was insignificant (-0.02). The strongest correlation was identified between attitude and final score (0.44), about equal between males and females. There was a slight correlation between attitude and deep learning approaches (0.11), even more so among males (0.17). However, there was no correlation between deep learning approaches and final scores (-0.01). Apart from attitude, demonstrating a strategic learning approach was positively correlated with points (0.15), while the demonstration of a surface learning approach was negatively correlated with final scores (-0.14).
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
Based on the correlation between variables (analyzed above), I choose to fit a regression model in order to understand the impact of age, attitude and learning approach on exam points. As it turns out, the student’s attitude is highly significant (p < 0.001), while age and a strategic learning approach are moderately significant (p = 0.0981 and 0.0621 respectively).
options(scipen = 999) #to indicate the minimal p-values with zeros
model1 <- lm(points ~ age + attitude + stra, data = learning2014)
summary(model1)
##
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 0.00006171726 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## attitude 3.48077 0.56220 6.191 0.00000000472 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 0.0000000107
For the sake of comparison, I choose to make a second model. Excluding the modestly significant variables ‘age’ and ‘stra’ does not really seem to make any big difference (I have not conducted a hypothesis test between the models, but the intercept and attitude remain highly significant in both models.)
model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 0.00000000195 ***
## attitude 3.5255 0.5674 6.214 0.00000000412 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 0.000000004119
I choose to go forth with the first model (model1). According to the model, exam scores can be predicted if we know the person’s age, attitude and scores measuring their strategic learning approach. As stated earlier, age is slightly negatively correlated with exam score, while attitude is to a much larger degree positively correlated with exam score. A strategic learning approach is moderately correlated with exam scores.
Let us make two example calculations, one for a moderately motivated young undergraduate (“Mike”), another for a highly motivated late bloomer (“Ritva”). Mike, age 22, scores on attitude and strategic learning 2 out of 5. Ritva, age 55, cores on attitude and strategic learning 5 out of 5. We can now predict their exam scores:
students <- c('Mike','Ritva')
age <- c(22, 55)
attitude <- c(2,5)
stra <- c(2,5)
new_data <- data.frame(students, age, attitude, stra)
predict(model1, newdata = new_data)
## 1 2
## 17.92358 28.46580
Effectively, R here makes the following calculations using these coefficients:
model1$coefficients
## (Intercept) age attitude stra
## 10.89542805 -0.08821869 3.48076791 1.00371238
Calculating predictions using the coefficients can be done in the following manner. (Differences in decimals are due to rounding errors.)
#Mike:
10.89543+(-0.08822*22)+(3.48077*2)+(1.00371*2)
## [1] 17.92355
#Ritva:
10.89543+(-0.08822*55)+(3.48077*5)+(1.00371*5)
## [1] 28.46573
The R-squared value (0.2182) can be used to summarize how well the regression line fits the data. Using the R-squared value, we see that the model makes a fairly good fit, explaining about 22 per cent of the sample variance. (In a simplified manner, one could state that a fifth of the variation in exam scores is explained by the students’ attitudes.)
summary(model1)$r.squared
## [1] 0.218172
In the following, I will produce three diagnostic plots in order to graphically explore the validity of my model assumptions. By analyzing residuals vs fitted values, I explore the validity of my model. Using the QQ-plot I explore whether errors are normally distributed. By exploring residuals vs leverage I want to find out whether there are any outliers. In this case, errors seem to be normally distributed, not correlated, and having a constant variance of Sigma^2. No severe outliers are identified.
par(mfrow = c(2,2))
plot(model1, which = c(1,2,5))
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 13 01:45:46 2021"
The Alc dataset provides data on alcohol consumption among Portuguese secondary school students aged 15 to 22 (mean 16.6 years). The data was collected using school reports and questionnaires. Attributes include student grades, demographic, social and school related features. The Alc dataset was constructed by joining two separate sets on student performance in two distinct subjects: Mathematics and Portuguese. As some students were known to have appeared in both sets, duplicates were identified and removed.
The Alc dataset features a total of 370 observations of 33 variables (listed below). Note: alc_use was calculated as a mean of workday alcohol consumption and weekday alcohol consumption, high_use is a binary variable indicating whether the student drinks more than 2 doses per day on average. The data was used in a paper by Cortez and Silva (2008). More information on the dataset along with attribute descriptions can be found at The Machine Learning Repository website.
alc <- read.table("data/alc.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
There are a number of variables that can be studied to understand student alcohol consumption. To get a quick overview of the content and distribution of each variable we can use the following code:
fig.width=8
fig.height=6
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
In order to find out which variables could possibly be statistically relevant, I begin by running a regression model explaining “alc_use” with all other variables (except for “Dalc”, “Walc” and “high_use” which are directly related to “alc_use”).
# exclude variables "Dalc","Walc","high_use" which are directly related to "alc_use"
selvarnames <- names(alc) %in% c("Dalc","Walc","high_use")
alc2 <- alc[!selvarnames == T]
# create formula with where "alc_use" is explained by select variables (stored in alc2)
fmla <- as.formula(paste("alc_use~",paste(names(alc2)[1:31],collapse="+")))
# create linear regression model and read the summary
model1 <- lm(data=alc2, formula = fmla)
summary(model1)
##
## Call:
## lm(formula = fmla, data = alc2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65003 -0.49853 -0.08492 0.38197 2.87986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.308078 0.928209 -0.332 0.740172
## schoolMS -0.206366 0.166819 -1.237 0.216943
## sexM 0.453517 0.098064 4.625 0.0000053993241 ***
## age 0.067109 0.043688 1.536 0.125478
## addressU -0.232701 0.119806 -1.942 0.052951 .
## famsizeLE3 0.159470 0.099574 1.602 0.110221
## PstatusT 0.016944 0.148389 0.114 0.909161
## Medu 0.021611 0.065772 0.329 0.742688
## Fedu 0.055074 0.056854 0.969 0.333409
## Mjobhealth -0.252106 0.227986 -1.106 0.269622
## Mjobother -0.111991 0.146597 -0.764 0.445450
## Mjobservices -0.135048 0.166597 -0.811 0.418166
## Mjobteacher -0.066798 0.210155 -0.318 0.750799
## Fjobhealth 0.066192 0.305988 0.216 0.828870
## Fjobother 0.179164 0.224244 0.799 0.424887
## Fjobservices 0.399310 0.231379 1.726 0.085325 .
## Fjobteacher -0.115857 0.271994 -0.426 0.670421
## reasonhome 0.045740 0.113435 0.403 0.687044
## reasonother 0.409977 0.164654 2.490 0.013270 *
## reasonreputation 0.056384 0.117446 0.480 0.631488
## guardianmother -0.149224 0.108597 -1.374 0.170344
## guardianother -0.232964 0.235168 -0.991 0.322593
## traveltime 0.152445 0.069143 2.205 0.028161 *
## studytime -0.115010 0.058181 -1.977 0.048903 *
## schoolsupyes 0.021980 0.140610 0.156 0.875877
## famsupyes -0.057759 0.097938 -0.590 0.555763
## activitiesyes -0.185204 0.090385 -2.049 0.041249 *
## nurseryyes -0.261139 0.112122 -2.329 0.020461 *
## higheryes 0.237373 0.241308 0.984 0.325989
## internetyes 0.042568 0.131414 0.324 0.746201
## romanticyes 0.052334 0.097849 0.535 0.593119
## famrel -0.178569 0.048777 -3.661 0.000293 ***
## freetime 0.069539 0.048266 1.441 0.150606
## goout 0.293976 0.042174 6.971 0.0000000000173 ***
## health 0.056055 0.032523 1.724 0.085729 .
## failures 0.169898 0.089496 1.898 0.058521 .
## paidyes 0.287764 0.095692 3.007 0.002840 **
## absences 0.028011 0.008503 3.294 0.001094 **
## G1 -0.057193 0.037889 -1.509 0.132136
## G2 0.042812 0.047726 0.897 0.370356
## G3 0.006646 0.034650 0.192 0.848011
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8055 on 329 degrees of freedom
## Multiple R-squared: 0.4191, Adjusted R-squared: 0.3485
## F-statistic: 5.935 on 40 and 329 DF, p-value: < 0.00000000000000022
From the regression model summary, I can read that male sex is a highly significant variable as well as the quality of family relationships and going out with friends. The variables “paid” (extra paid classes in Math or Portuguese) and “absences” (number of school absences) are fairly significant. Moderately significant variables include study time, travel time, whether the student attended nursery school and whether the student has taken part in extra-curricular activities.
In order to understand the relationship between high alcohol consumption and select variables, I choose to visualize the relationship between high alcohol comsumption and sex/gender, family relationships, going out and study time. For the analysis, I construct a subset comprising columns 2 (sex), 22 (family relationship), 24 (going out), 14 (study time) and 35 (high_use. The visualizsation is presented in sex-disaggregated form.
library(ggplot2)
library(GGally)
#construct subset comprising columns 2 (sex), 22 (family relationship), 24 (going out), 14 (study time) and 35 (high_use)
alcpairs <- alc[, c(2, 22,24,14,35)]
#plot the pairs
ggpairs(alcpairs, mapping = aes(col=sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
From the visual presentation, I can formulate the following working hypotheses for the select variables:
The table below shows a summary of key statistics related to heavy and moderate drinkers (high_use = True/False) looking at four variables (three numeric and sex/gender).
alc %>% group_by(sex, high_use) %>% summarise(count = n(), famrel=mean(famrel), goout=mean(goout), studytime=mean(studytime))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 6
## # Groups: sex [2]
## sex high_use count famrel goout studytime
## <fct> <lgl> <int> <dbl> <dbl> <dbl>
## 1 F FALSE 154 3.92 2.95 2.34
## 2 F TRUE 41 3.73 3.39 2
## 3 M FALSE 105 4.13 2.70 1.90
## 4 M TRUE 70 3.79 3.93 1.63
The table above indicates what has been stated previously: that Portuguese young heavy drinkers generally go out more than their moderately drinking peers. Also, they spend spend less time studying and they estimate their family relations to be worse.
To indicate the relationship between family relations and heavy drinking, let’s draw a separate box plot. As the chart indicates, moderate consumers of alcohol have better family ties The difference between groups is not, however, as big as the boxplot visualization would initially indicate. Therefore, red dots have been added to indicate mean values by group and sex.
The boxplot visualization indicates that students with high alcohol consumption score lower on the quality of family relationships, but the difference in mean values between groups is much lower than the visualization focused on integer values indicates: the difference is only 0.34 score points for males and 0.19 for females. In this respect, the boxplot visualization is easily misleading.
NB! 1) In both groups there are outliers scoring very low on family relations. In general, however, respondents have scored fairly high (median=4). NB! 2) The figure says little about whether the quality of family relations are the cause or consequence of high alcohol consumption.
# initialise a plot of high_use and famrel
g <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))
g + geom_boxplot() + ylab("family relation") + ggtitle("Quality of family relationships \n by alcohol consumption and sex") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun=mean)
## Warning: Removed 4 rows containing missing values (geom_segment).
We can justify the claim of a misleading visualization by adjusting the code for the previous table and also report the median values. Doing this allows us to see that there is no difference in median values between groups and sexes in family relations. Reporting median values will also tell us that there is a difference in the “goout” variable, with heavy drinkers more inclined towards going out. (The median study time remains the same, 4.)
alc %>% group_by(sex, high_use) %>% summarise(count = n(), famrel_mean=mean(famrel), famrel_median=median(famrel), goout_mean=mean(goout), goout_median=median(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 7
## # Groups: sex [2]
## sex high_use count famrel_mean famrel_median goout_mean goout_median
## <fct> <lgl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 F FALSE 154 3.92 4 2.95 3
## 2 F TRUE 41 3.73 4 3.39 4
## 3 M FALSE 105 4.13 4 2.70 3
## 4 M TRUE 70 3.79 4 3.93 4
In the logistic regression model, the computational target variable is the log of odds. From this it follows that applying the exponent function to the fitted values gives us the odds. That is, the exponents of the coefficients can be interpreted as odds ratios between a unit change (vs no change) in the corresponding explanatory variable (according to the course video on Data Camp).
From this we see that with an odds ratio of over 2, males are more than twice as likely to have a “success” (that is, become heavy drinkers) compared to their female peers when controlling for family relations, going out and study time. The same goes for those student spending much time going out. On the other hand, those students with good family relations and spending much time studying are not as likely to become heavy drinkers; their odds ratios are only about 2/3 compared with their average peers.
The confidence intervals presented in the two rightmost columns (2.5 % and 97.5%) in the table below gives us an indication about the spread. From this we see for instance that there is a wider spread between males than between those going out. Although the odds ratio is about the same in both groups some males will have considerably higher odds compared to some outgoers.
The data presented here largely supports my initial working hypotheses about men being more likely than women to be high consumers, about students with high alcohol consumption generally scoring lower on the quality of family relationships, about high alcohol consumption being related to the frequency of going out and about lower study times being related to higher alcohol consumption.
However, revisiting the hypotheses reveals to me a somewhat erroneous wording and approach, focusing to much on the faults of those that I have called heavy drinkers (or those with a positive high_use variable). It would perhaps be more fair to comment on the relationship between high daily alcohol doses and background factors, and try to remove any judgmental attitudes.
Below is the printout of the summary of the logistic regression model and the table with odds ratios and confidence intervals.
m <- glm(high_use ~ sex + famrel + goout + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + famrel + goout + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5891 -0.7629 -0.4976 0.8304 2.6937
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2672 0.7312 -1.733 0.08309 .
## sexM 0.7959 0.2669 2.982 0.00286 **
## famrel -0.4193 0.1399 -2.996 0.00273 **
## goout 0.7873 0.1230 6.399 0.000000000157 ***
## studytime -0.4814 0.1711 -2.814 0.00490 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 370.69 on 365 degrees of freedom
## AIC: 380.69
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2816141 0.06545486 1.1622100
## sexM 2.2165443 1.31841735 3.7630180
## famrel 0.6575181 0.49791884 0.8636137
## goout 2.1974324 1.73873662 2.8198312
## studytime 0.6179355 0.43751933 0.8576353
Due to the chosen method at the start of this exercise, all of the variables controlled for were found to be statistically significant, one on a 0.1 per cent level and three on a 1 per cent level. I can therefore explore the predictive power of my model as such.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to "alc"
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probabilities > 0.5)
# tabulate the target variable versus the predictions. The numbers are the count of individuals.
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 230 29
## TRUE 59 52
# tabulate the target variable versus the predictions. The table shows the proportions.
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.62162162 0.07837838 0.70000000
## TRUE 0.15945946 0.14054054 0.30000000
## Sum 0.78108108 0.21891892 1.00000000
Based on the data and the reported alcohol intake, exactly 30 per cent of respondents (111 of 370) were classified as heavy drinkers (high_use = true). Equally, 70 per cent (259 individuals) were classified as non-heavy drinkers (high_use = false). According to the model, 29 of 259 moderately drinking individuals (11 per cent) were erroneously predicted to be heavy drinkers. Out of 111 heavy drinkers, a majority (59) were erroneously predicted not be high_users although they were according on the data.
Although the model did fairly well in recognizing non-heavy users, it did worse in predicting heavy drinkers among those who in fact were (based on the reported intake). Overall, the model predicted 78 per cent of respondents to be non-heavy users, while the actual proportion was 70 per cent.
The proportion of wrongly classified individuals is displayed visually in the plot below. As the visualization makes clear, the proportion of inaccurately classified individuals (i.e. the training error) is fairly high. The mean prediction error can be computed by defining a loss function and comparing classifications and probabilities. The calculation indicates that nearly 24 per cent are wrongly classified. I would not have expected the analysis to give such a high number. On the other hand, the model could become more accurate with a higher number of observations.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378
At the loss rate of 24 per cent, it seems as if the model is not very accurate. Can I use the information? It depends on the purpose. For instance, if the goal was to target alcohol drinkers and feed them with advertisements on social media (assuming that heavy drinkers would be more inclined to buy booze), the information could definitely be useful (if one wanted to target alcohol ads at minors…). For identifying who will become an alcoholic in five years and who would need interrogative and intervening actions, the model is far too inaccurate. At this level of accuracy, one could perhaps only target information campaigns on the potential harms caused by drinking at an early age.
With ten-fold cross-validation, the prediction error seems to be larger than when using the loss function (with a difference of about one percentage point).
# Performing ten-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405
# calculating the difference between using ten-fold cross-validation and a loss function
cv$delta[1]-loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.002702703
Finally, I will perform cross-validation to compare the performance of different logistic regression models using different sets of predictors. I start with the maximum number of predictors and explore the changes in the training and testing errors as I drop variables one by one. The original dataset contained 33 variables, but since high_use is directly related to Walc and Dalc, I choose to exclude these for computational feasibility.
In practice, I begin with creating a vector with running numbers in decreasing order from 31 to 1 indicating how many variables I will test at a time. I then create two more empty vectors, where I will save the results from the computational exercise. The three vectors will be used to construct the resulting data frame.
I then write a for loop to construct as many formulas as there are variables at any given time. I begin by making the cross-validation using the largest number of variables (31) The results are saved in a data frame along with the number of variables used. The resulting table will have three columns indicating Number of variables, Prediction error rate and Training error rate and 31 rows, one for every run.
Finally, I construct a plot indicating how both prediction and training errors decrease as the number of variables increase, with prediction errors always being greater. Looking at the resulting plot, I found it initially tough to digest the great variance and especially the initially increasing trend in prediction errors. But I guess that might have been be a result of a varying number of statistically significant variables being used in the different models.
The downside of the fairly long code written below is the time it takes to execute it. I have found that executing this last chunk alone - comparing different models with up to 31 variables - takes about three minutes. The time needed makes me relucant to run the code and test for any small changes, unless I have reduced the number of variables to a handful. If you have any suggestions on how to improve the code and speed up the process, I will gladly appreaciate your suggestions!
library(dplyr)
library(boot)
howmanyvar <- 31 #Enter here how many variables you want to test for. 31 is the maximum. (33 variables were included in the original dataset but two of these will always be excluded for computational feasibility and avoidance of near-perfect correlation.)
#create vector, in sequence, starting from the number above, descending by 1.
v <- rev(seq(1,howmanyvar))
#create empty numeric vectors of same length for the results.
trainingerrors <- integer(howmanyvar)
predictionerrors <- integer(howmanyvar)
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
for(i in v) {
#Within the for loop, I first create temporary subsets called "alctest". I choose to exclude variables "Dalc", "Walc" as these were found to cause warnings when executing the logistic regression model and counting probabilities (highly correlated with "high_use"). I also exclude "probability" and "prediction" as the probability has been calculated based on the whole dataset and this will now be replaced.
exclvarnames <- names(alc) %in% c("Dalc","Walc", "alc_use", "probability", "prediction")
alctest <- alc[!exclvarnames == T]
#From the alctest subset I will gradually drop columns one at a time, starting from the number of variables entered in "howmanyvar". In alctest, 32 has become the index number for the variable high_use. Therefore, I always want to include that one.
alctest <- dplyr::select(alctest, 1:v[i], 32)
#However, I want to exclude "high_use" from the vector with columns names that I want to use on the right-hand side in the formula.
fnames <- names(alctest)[names(alctest) !="high_use"]
# create formula where "alc_use" is explained by the variables
f <- as.formula(paste("high_use~",paste(fnames,collapse="+")))
# run a logistic regression
m2 <- glm(f, data = alctest, family = "binomial")
# predict() the probability of high_use using model m2
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to "alctest"
alctest <- mutate(alctest, probability = probabilities)
# compute the average number of wrong predictions in the (training) data and save the result in vector
trainingerrors[i] <- loss_func(alctest$high_use,alctest$probability)
# K-fold cross-validation
cv <- cv.glm(data = alctest, cost = loss_func, glmfit = m2, K = nrow(alctest))
# compute the average number of wrong predictions in the cross validation and save the result in vector
predictionerrors[i] <- cv$delta[1]
}
results <- data.frame(variables=v, trainingerrors=trainingerrors, predictionerrors=predictionerrors)
results
## variables trainingerrors predictionerrors
## 1 31 0.1918919 0.2567568
## 2 30 0.1891892 0.2621622
## 3 29 0.1945946 0.2540541
## 4 28 0.1972973 0.2540541
## 5 27 0.2162162 0.2783784
## 6 26 0.2108108 0.2621622
## 7 25 0.2162162 0.2702703
## 8 24 0.2162162 0.2729730
## 9 23 0.2675676 0.3162162
## 10 22 0.2594595 0.3270270
## 11 21 0.2729730 0.3216216
## 12 20 0.2675676 0.3297297
## 13 19 0.2702703 0.3297297
## 14 18 0.2783784 0.3270270
## 15 17 0.2810811 0.3162162
## 16 16 0.2783784 0.3162162
## 17 15 0.2810811 0.3081081
## 18 14 0.2810811 0.3081081
## 19 13 0.2729730 0.3135135
## 20 12 0.2783784 0.3162162
## 21 11 0.2729730 0.3324324
## 22 10 0.2918919 0.3189189
## 23 9 0.2864865 0.3108108
## 24 8 0.2918919 0.3162162
## 25 7 0.2945946 0.3054054
## 26 6 0.2945946 0.3054054
## 27 5 0.2918919 0.3027027
## 28 4 0.2972973 0.3135135
## 29 3 0.3054054 0.3135135
## 30 2 0.3000000 0.3000000
## 31 1 0.3000000 0.3000000
p <- ggplot(results, aes(x=variables)) + geom_line(aes(y=predictionerrors, color="prediction")) + geom_line(aes(y=trainingerrors, color="training"))
p + ggtitle("Relation between error rates\n and number of variables") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Number of variables") + ylab("Error rate") + scale_color_discrete(name="Type of error")
date()
## [1] "Mon Dec 13 01:55:18 2021"
The Boston dataset, included in the MASS package, contains 506 observations of 14 variables relating to housing in the Boston region. Each observation describes a Boston suburb or town. Variables include information such as crime rate, air pollution, ethnic composition, proportion of land for large properties or industries, taxation, distances, communications and pricing. As has been state elsewhere the dataset has become a much-used pedagogical tool for teaching regression analysis and machine learning. First, let us swiftly explore the contents!
# accessing the MASS package
library(MASS)
# loading the Boston data
data("Boston")
# typing ?Boston will open the documentation on the variables in the R console. A description can also be found [here] (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html).
?Boston
# a summary of the content of the variables is given below
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
It is perhaps hard to form an understanding of the relationship between the 14 variables simply by mapping pairs. Drawing a corrplot diagram allows us to get a quick overview of the correlation between variables. The greater the circle, the greater the correlance. Red indicates negative correlance, blue positive.
# plotting the relation between variables with corrplot
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# rounding values
cor_matrix<-cor(Boston) %>% round(2)
# drawing a corrplot
corrplot(cor_matrix, method="circle", type = "upper", tl.cex = 0.6, tl.pos = "d")
# contrary to the instructions provided on Data Camp, I choose to exclude the 'cl.pos = "b"' command, as I prefer to have the color legend vertically on the right rather than horizontally below the figure.
What sticks out from the visualization above is the strong negative correlation between the distance to employment centres (dis) and the proportion of old houses (age), nitrogen oxides concentration in the air (nox) and proportion of non-retail businesses (indus).That is, the further away we are from employment centres, the higher the proportion of new houses, the higher the share of industrial properties and the higher the concentration of nitrogen oxides in the air.
Similarly, there is a strong positive correlation between accessibility to highways (rad) and property-tax rate (tax). That is, the better the access to highways, the higher the property tax rate.
In the following section, I will standardize the variables which allows me to compare them more easily and perform additional operations using them. (Ideally, I would explore the differences between the standardized and non-standardized variables graphically, for instance using the ggplot bar graph technique described here, but as this was not specifically demanded, I suspect this would perhaps be beyond the scope of this exercise.) For now, it will suffice to print out a summary of the variables in the standardized set.
From comparing the summaries, it can be seen that the range of select variables has decreased significantly (e.g. crim, zn, indus). In fact, the maximum value of crim (9.92) represents the maximum value of all standardized variables. As all variables have now been centered around a mean of zero, minimum values have all become negative for all variables.
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summarize the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
As instructed, I continue by creating a categorical variable for the crime rate using the quantiles as break points. I choose to name the variables from low to high. I then drop the old crime rate variable from the boston_scaled dataset.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Moreover, I divide the dataset into train and test sets so that 80% of the data belongs to the train set. This is done as a means of preparation for fitting a linear discriminant analysis model on the train set.
# dividing the dataset into train and test sets, so that 80% of the data belongs to the train set. I randomly assign 80 per cent of the rows in the Boston dataset to the train set.
n <- nrow(Boston)
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Now that I have created the train and test sets, I continue by fitting the linear discriminant analysis on the train set. I choose the categorical crime rate (crime) as target variable and all the other variables as predictor variables. I will also plot the LDA fit. The chunk below includes a code for defining a function for enriching the plot with arrows.
# fitting the linear discriminant analysis on the train set using the categorical crime rate as target variable and all other variables in the dataset as predictor variables.
lda.fit <- lda(crime ~ ., data = train)
# this function can be used for adding arrows to the biplot that will be plotted next
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plotting the lda results with arrows added
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)
The image drawn above gives an indication of in which direction the various variables draws the results. The longest arrow ic clearly given by the rad variable, zn and nox also pulling the results fairly strongly in different directions.
Continuing to follow the given instructions, I save the crime categories from the test data before removing the crime variable. This allows me to evaluate the correctness of predictions when using the test data to predict crime classifications. The cross tabulation of predictions and correct results is given below.
The results of the cross tabulation are interesting especially looking at the four corners. None of the areas where the crime rate was predicted low actually had high crime rates and vice versa. And similarly: where crime rates were predicted high, they actually were high, and where they were predicted low, they chiefly were low. Some variation occured nevertheless as to how right the prediction was. For instance, of those areas with low rates only half were correctly classified as areas with low rates (the other half were classified as med_low with one instance as med_high). It seems as if the model did a better job in getting areas with high criminal rates right than areas with low rates.
# saving crime categories from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predicting classes with the test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulating the results with the crime categories from the test set (removed from the test set)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 10 1 0
## med_low 6 8 11 0
## med_high 1 1 18 0
## high 0 0 0 31
I will now continue to execute the last step of the exercise proper. I begin by reloading the Boston dataset and standardize the variables. I will prepare a euclidean distance matrix to calculate the distances between the observations, and then run a k-means algorithm on the dataset to let the computer sort and clusterize the data. The clusters generated by the k-means algorithm will be plotted.
# reloading the Boston dataset
library(MASS)
data("Boston")
# standardizing the dataset
Boston <- scale(Boston)
# preparing a euclidean distance matrix
dist_eu <- dist(Boston)
# running a k-means algorithm on the dataset
km <-kmeans(Boston, centers = 3)
# plotting the results
pairs(Boston, col = km$cluster)
As is seen, the chart above looks really busy. We can determine the optimal number of clusters by plotting the results of a k-means algorithm run with the numbers from 1 to 10. The result is given below.
The optimal number of clusters is supposed to be when the curve drops sharply. In this case, there is no self-evident answer to what is the optimal number. In my interpretation, the drop is at its sharpest with two cluster centers, after which the decline slows down. I find it reasonable to cluster using two centers. A new plot is printed below.
# calculating the total within sum of squares (up to 10 clusters)
set.seed(123) #this command is used in conjunction with the function
twcss <- sapply(1:10, function(k){kmeans(Boston, k)$tot.withinss})
# visualizing the results
library(ggplot2)
qplot(x = 1:10, y = twcss, geom = 'line') + scale_x_continuous(breaks = c(2,4,6,8,10), limits=c(1,10))
# running again the k-means algorithm on the dataset with the newly determined optimal number of clusters
km <-kmeans(Boston, centers = 2)
# plotting the results with pairs
pairs(Boston, col = km$cluster)
For the bonus section, I will run the k-means algorithm on the original (standardized) Boston data with 3 cluster centers. An LDA is performed with clusters as target classes. The biplot with arrows is given below. As it appears as if zn, nox, medv and tax are the most influential linear separators for the clusters. That is, the proportion of residential land allocated for large properties, air pollution, price level and property tax rate seem to be the variables most strongly influencing which cluster a particular area belongs to.
# reloading the Boston dataset once more
library(MASS)
data("Boston")
# standardizing the dataset
Boston <- as.data.frame(scale(Boston))
# performing k-means on the dataset
km <-kmeans(Boston, centers = 3)
# saving clusters to be used with the LDA
clusters <- km$cluster
# adding the new clusters variable to the set
Boston <- data.frame(Boston, clusters)
# dividing the dataset into train and test sets to be used with the LDA
n <- nrow(Boston)
ind <- sample(n, size = n * 0.8)
train2 <- Boston[ind,]
# removing chas because I repeatedly received the error message "variable 4 appears to be constant within groups"
train2 <- dplyr::select(train2, -chas)
# performing the lda
lda.fit2 <- lda(clusters ~ ., data = train2)
# defining the arrows function
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plotting the lda results with arrows
plot(lda.fit2, dimen = 2, col=clusters, pch=clusters, ylim=c(-5,7),xlim=c(-5,5))
lda.arrows(lda.fit2, myscale = 4)
For the super bonus section, I apply the given code that helps me to create a matrix product, a projection of the data points that will be visualized. I will draw two 3D plots, one where the color is defined by the categorical crime classes, another where the color is defined by the clusters of the k-means. As can be seen here, the shape of the plots is identical. However, the colouring attributed by the categorical crime classes is much more ‘neat’, aiding the eye in forming an understanding of which elements that belong together. For instance, all yellow values are gathered at the left hand side of the chart (with high x values). Turning and twisting the graphic with the mouse helps understanding the data even better.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# accessing the plotly package in order to create a 3D plot of the columns of the matrix product. (I have ran the command install.packages("plotly") once already.)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~classes)
# extracting the clusters from the second train set
cluscol <- train2$clusters
# drawing the second plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~cluscol)
date()
## [1] "Mon Dec 13 01:55:31 2021"
The human dataset contains country-specific indicators relating to the Human Development Index (HDI), provided by the United Nations Development Programme. The index measures achievements in dimensions of human development such as health and longevity, education and standard of living.
Variables included in this exercise indicate the ratio of women to men in secondary education and labour force, as well as expected length of education and life. Variables also include Gross National Income as a measure of standard of living, and maternal mortality and adolescent birth rate as measures of the level of health care provided to young women.
# reading the human data from my data folder
human <- read.csv("./data/human.csv", stringsAsFactors = F, row.names = 1)
# exploring the data structure
head(human)
## EduRatio LabRatio EducExp LifeExp GNI Matmort AdolBirthRt
## Norway 1.0072389 0.8908297 17.5 81.6 64992 4 7.8
## Australia 0.9968288 0.8189415 20.2 82.4 42261 6 12.1
## Switzerland 0.9834369 0.8251001 15.8 83.0 56431 6 1.9
## Denmark 0.9886128 0.8840361 18.7 80.2 44025 5 5.1
## Netherlands 0.9690608 0.8286119 17.9 81.6 45435 6 6.2
## Germany 0.9927835 0.8072289 16.5 80.9 43919 7 3.8
## ParlRep
## Norway 39.6
## Australia 30.5
## Switzerland 28.5
## Denmark 38.0
## Netherlands 36.9
## Germany 36.9
summary(human)
## EduRatio LabRatio EducExp LifeExp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Matmort AdolBirthRt ParlRep
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# plotting pairs
library(GGally)
ggpairs(human)
Below is a corrplot diagram of the correlation relations between variables. Based on the data, it is interesting to note that parliament representation does not seem to be particularly strongly correlated with anything. At least from the corrplot diagram, no strong correlation can be discerned (share of females in parliament is moderately correlated with labour force ratio, expected length of education and life expectancy). Neither does labour force ratio seem to be strongly correlated with anything.
# drawing a corrplot
library(corrplot)
cor(human) %>% corrplot()
First, I will perform a principal component analysis (PCA) on the non-standardized human data, as instructed in the exercise. The variability captured by the principal components is given in the printout of values below. The plot highlights the impact of the Gross National Income (GNI) variable as this was the one variable containing the largest absolute numbers (maximum values over 100 times larger than in any other variable).
# perform principal component analysis (with the SVD method) on the human dataset in its *non-standardized* form
pca_human <- prcomp(human)
# exploring the variability captured by the principal components
summary(pca_human)$importance[2,]
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 0.9), col = c("grey40", "deeppink2"), main = "Impact of GNI on HDI")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Now, I will redo the PCA analysis with standardized variables.
# standardize variables
human_std <- scale(human)
# perform principal component analysis (with the SVD method) on the human dataset in its *standardized* form
pca_human_std <- prcomp(human_std)
# exploring the variability captured by the principal components
s <- summary(pca_human_std)$importance[2,]
s
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
# save rounded percetanges of variance captured by each PC (to be used as axis labels)
pca_pr <- round(s*100, digits = 1)
# create object pc_lab (to be used as axis labels)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 0.9), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "Interconnected variables impacting HDI")
As was seen above, the first plot drawn from the non-standardized version was extremely hard to read as almost all results in the scatter plot were pushed into the upright corner into a really crowded space. Because of the high absolute values reported in the GNI variable, that is about the only arrow whose direction is visible in the first plot; I can’t even tell if there are any arrows pointing in any other direction (I suppose the arrows of ParlRep and LabRatio are pointing upwards, but they are at least not properly visible.)
The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living according to the UNDP.
From the latter plot we can discern three groups of variables that seem to influence the HDI index value in different ways. Life expectancy, maternal mortality and adolescent birth rate are all related to living a long and healthy life. Life expectancy is negatively correlated with maternal mortality and adolescence birth rate, and so the arrows here pull results in opposite directions.
The proportion of women in labour force and parliament (visible as arrows pointing upwards) can be interpreted as variables measuring women’s possibilities to participate in and impact society. The proportion of women in parliament is moderately correlated with the proportion of women in the labour force.
Interestingly, the ratio of women to men in secondary education is more strongly related to life expectancy than to the proportion of women in parliament or the ratio of women to men in the labour force. Being knowledgeable seems to be more strongly correlated with being able to live a long and healthy life than the ability to participate in working life and national politics.
The scatter plot presented through the PCA analysis is a very interesting one. The GNI variable that stood out in the first plot is now almost entirely hid behind the EduRatio variable (and thus very strongly correlated with that variable). In a way, this is the key to understanding the horizontal x axis of the plot. On the left-hand side, we find rich countries, both European, Asian and Arab. On the right-hand side, we find poorer countries, many of which have been hit by war and poverty and lack good public health services.
The vertical y axis is also an interesting one, although initially harder to label. I speculated over wheter it was a conservative-liberal axis or progressive-reactionary axis, before I concluded that the y axis is all about gender equality. Let me explain why.
The angle between a variable and a PC axis can be interpret as the correlation between the two. In the case of the x axis, the difference in angle is minimal between the variables at each horizontal end. As the difference in angle between the ParlRep and LabRatio variables and the y axis are almost as small (although significantly larger), one could understand the y axis as an axis reflecting equality between women and men (in politics and the workforce), placing Rwanda at the top and hard-line Arab states at the bottom. (Rwanda’s top placement might come in as a surprise for somebody, but the power balance between men and women in Rwanda, and attitudes and approaches chosen by Rwandian women, have been elaborated in some media articles for example here.)
In principal component analyses, the first principal component captures the maximum amount of variance from the features of the original data. In this instance, the PC1 (relating to life expectancy and health issues) explains just over half of the variance. But the gender equality aspects relating to female representation in the labour force and national politics account for only a sixth of the variance between countries. (Note how one single variable, the GNI explained 99 per cent of the variation in the erroneous first plot.)
A note on the side: What the plots - and the data - has not said anything about is the division of income and wealth distribution within countries. Attempts to quantify inequality have at various times been made using for instance the Gini coefficient, although this measurement unit has shortcomings of its own (it does not for instance take into account the effect of income redistributions, differences in living expenses between countries as well as distribution of wealth).
The tea dataset of the FactoMineR package includes 300 observations of 36 variables relating to tea drinking habits. Most variables are factors with 2 levels (i.e. binaries), but some variables include factors with more levels. Initially, I will visualise the data by drawing histograms of six of the most interesting variables. I will then conduct a Multiple Correspondence Analysis on the tea data and offer an interpretation the results of the MCA and draw a variable biplot.
# loading the tea dataset
library(FactoMineR) #I installed Factominer before writing this code
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
data("tea")
# exploring the tea dataset
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
?tea
dim(tea)
## [1] 300 36
# moving on to visualization. I choose to visualize six of the variables that I find most interesting.
keep_columns <- c("age_Q","frequency","How","price","SPC","Tea")
tea_visu <- select(tea, one_of(keep_columns))
gather(tea_visu) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# conducting the Multiple Correspondence Analysis. I continue with my selection of six interesting variables.
mca <- MCA(tea_visu, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_visu, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.350 0.292 0.234 0.214 0.206 0.198 0.189
## % of var. 9.127 7.610 6.092 5.588 5.386 5.172 4.926
## Cumulative % of var. 9.127 16.737 22.830 28.417 33.803 38.975 43.901
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.181 0.180 0.178 0.170 0.166 0.152 0.149
## % of var. 4.730 4.705 4.636 4.431 4.334 3.977 3.897
## Cumulative % of var. 48.631 53.336 57.972 62.403 66.737 70.714 74.611
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.145 0.140 0.134 0.123 0.113 0.109 0.108
## % of var. 3.789 3.663 3.484 3.222 2.960 2.849 2.813
## Cumulative % of var. 78.400 82.064 85.547 88.769 91.729 94.577 97.390
## Dim.22 Dim.23
## Variance 0.055 0.046
## % of var. 1.423 1.188
## Cumulative % of var. 98.812 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.239 0.054 0.008 | 1.146 1.501 0.184 | 0.579 0.478
## 2 | 0.461 0.202 0.060 | 0.798 0.727 0.181 | 0.313 0.140
## 3 | 0.025 0.001 0.000 | 0.459 0.241 0.057 | 0.211 0.063
## 4 | -0.847 0.684 0.411 | -0.364 0.151 0.076 | 0.155 0.034
## 5 | -0.125 0.015 0.008 | 0.193 0.042 0.018 | -0.049 0.003
## 6 | -0.973 0.901 0.257 | -0.391 0.175 0.042 | 0.122 0.021
## 7 | 0.055 0.003 0.001 | 0.531 0.322 0.069 | 0.818 0.955
## 8 | 0.401 0.153 0.035 | 0.702 0.563 0.108 | 1.012 1.462
## 9 | 0.454 0.196 0.051 | 0.693 0.549 0.118 | 0.767 0.839
## 10 | 0.680 0.441 0.117 | 0.520 0.309 0.069 | 0.911 1.185
## cos2
## 1 0.047 |
## 2 0.028 |
## 3 0.012 |
## 4 0.014 |
## 5 0.001 |
## 6 0.004 |
## 7 0.163 |
## 8 0.225 |
## 9 0.144 |
## 10 0.210 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## 15-24 | -1.028 15.424 0.467 -11.817 | -0.722 9.133 0.231 -8.303
## 25-34 | -0.194 0.412 0.011 -1.834 | 0.556 4.059 0.092 5.252
## 35-44 | 0.438 1.220 0.030 2.972 | 0.967 7.123 0.144 6.558
## 45-59 | 0.410 1.627 0.043 3.580 | 0.732 6.227 0.137 6.396
## +60 | 1.721 17.867 0.429 11.332 | -1.454 15.306 0.307 -9.577
## 1/day | -0.080 0.098 0.003 -0.947 | 0.450 3.670 0.094 5.302
## 1 to 2/week | -0.320 0.717 0.018 -2.297 | -0.096 0.077 0.002 -0.686
## +2/day | 0.149 0.446 0.016 2.204 | -0.262 1.658 0.050 -3.879
## 3 to 6/week | 0.084 0.038 0.001 0.519 | -0.157 0.159 0.003 -0.969
## alone | -0.217 1.454 0.087 -5.107 | -0.080 0.237 0.012 -1.882
## Dim.3 ctr cos2 v.test
## 15-24 | 0.285 1.775 0.036 3.275 |
## 25-34 | -0.740 8.994 0.164 -6.995 |
## 35-44 | 0.865 7.127 0.115 5.870 |
## 45-59 | 0.058 0.048 0.001 0.502 |
## +60 | -0.349 1.099 0.018 -2.296 |
## 1/day | -0.644 9.372 0.192 -7.580 |
## 1 to 2/week | 0.576 3.468 0.057 4.127 |
## +2/day | 0.175 0.926 0.023 2.594 |
## 3 to 6/week | 0.400 1.297 0.020 2.475 |
## alone | -0.012 0.006 0.000 -0.271 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## age_Q | 0.767 0.732 0.267 |
## frequency | 0.027 0.097 0.211 |
## How | 0.156 0.074 0.139 |
## price | 0.162 0.097 0.207 |
## SPC | 0.677 0.745 0.327 |
## Tea | 0.310 0.005 0.250 |
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
As the distance between variable categories in a MCA provides a measure of their similarity, this gives me some hints as to how to interpret tea drinking habits of different population groups. The pink colour indicates occupation, black indicates age, blue indicates tea brand, red indicates tea drinking frequency, brown indicates kind of tea (green or black or other) and green indicates ways of drinking tea (e.g. with milk or lemon).
Heavily stereotyped, one could suggest that in the down-right corner we find non-workers and seniors who might have an occasional cup of tea of some cheap label every now and then. In the down-left corner, we find young students drinking private label teas, some of whom drink their teas fairly often, even twice a day. In the up-left corner, we find workers and employees having an occasional cup once a week or a regular cup once a day of some unknown or variable brand. In the up-right corner, we find the more posh, middle- and upper-class drinkers who do not necessarily drink tea that regularly (perhaps they prefer coffee), but who are presumably picky about their drink (the branded and upscale tea assortments are found here). Also, the closer we get to the middle and senior workers, the more green tea is consumed.
It’s worth noting, that the two dimensions plotted in the MCA factor map do not account for a significant amount of the variation - in fact, less than ten per cent each. That is, there are other variables who jointly explain much more of the variation in tea drinking habits.
date()
## [1] "Mon Dec 13 01:55:42 2021"
In Exercise 6 we were asked to conduct analyses on the RATS and BPRS datasets following the techniques described in the course book, Vehkalahti and Everitt’s (2019) Multivariate Analysis for the Behavioral Sciences. The trick was to swap datasets, implementing the analyses of Chapter 8 using the RATS data and Chapter 9 using the BPRS data.
The RATS dataset contains data from a nutrition study conducted on three groups of rats (Crowder and Hand, 1990). Vehkalahti and Everitt (2019: 174) summarize the study design: “the three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.”
I begin by reading the data, converting categorical variables to factors, and confirming the structures. (I considered factorizing the time variable, but refrained from doing so as the time value is essentially continuous by nature.) Moreover, let’s take a look at what the data looks like currently.
# Reading the data. (I have already transformed the data into long form using the wrangling script.)
RATSL <- read.csv("./data/ratsl.csv", stringsAsFactors = T)
# Converting categorical variables to factors
RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)
# Exploring the structure of the dataset
str(RATSL)
## 'data.frame': 176 obs. of 4 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
# Viewing the data
head(RATSL)
## ID Group Time Weight
## 1 1 1 1 240
## 2 2 1 1 225
## 3 3 1 1 245
## 4 4 1 1 260
## 5 5 1 1 255
## 6 6 1 1 260
On the Datacamp course site, the authors note that graphical displays of data are almost always useful for exposing patterns in the data, particularly when these are unexpected. Let’s begin by plotting the weights of all sixteen rats, differentiating between the groups into which they were divided.
library(dplyr); library(tidyr); library(ggplot2)
# Plotting rats
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
The graphical overview makes it clear that group 1 consists of small rats (weight < 300 g). Groups 2 and 3 seem to refer to chiefly mid-size and large rats (about 400-500 g and 500-600 g). However, the division is not categorical (some overlap is visible). It is not entirely clear to me on what basis the rats have been grouped.
The graphical presentation above has the advantage of a crystal-clear grouping and the possibility to view individual progress, at least in groups 2 and 3. However, one might get an impression that the graphs are wasting visual space. An alternative form of presenting the information is given in Vehkalahti and Everitt (2019: 177), offering a quick overview of the development of the rats. A downside of this is that groups 2 and 3 are hard to separate from each other, as the linetypes are so similar. An attempt to present all groups on the same page but differentiate between groups using colours is given below. After this I will continue with the model above.
ggplot(RATSL, aes(x = Time, y = Weight, group = interaction(Group, ID), colour = Group, linetype = Group)) +
geom_line() +
theme(legend.position = "top") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight))) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20))
Let’s explore the data further by doing a few exercises, standardising the Weight variable and plotting again with the standardised variable.
# Standardising the Weight variable and adding a column
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = ((Weight-mean(Weight))/sd(Weight))) %>%
ungroup()
# Glimpsing the data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
# Plotting again with the standardised Weight variable
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "standardized Weight", (limits = c(min(RATSL$stdWeight), max(RATSL$stdWeight))))
The graph drawn with standardized weights shows quite clearly how the weights of the rats are placed in relation to the mean, but also, the relative development of individuals and groups. In group 1, the development seem to have been fairly even, the change is hardly visible. In group 2, individuals move in opposite directions. In group 3, weigths seem to be about steady or declining.
Caution should be taken in drawing far-reaching conclusions from this data, especially considering the small number of individuals included in groups 2 and 3.
Let’s move on to summary graphs.
# Summary data with mean and standard error of RATS by treatment and week
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = (sd(Weight)/sqrt(length(unique(RATSL$ID))))) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Glimpsing the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 3.805394, 3.273268, 2.868977, 3.400204, 2.764370, 2.945942, 2.73…
# Plotting the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.4)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
Basically, the graph above shows that the variance in groups 1 and 3 is very small, and only slightly bigger in group 2. From the graph it would be roughly discernible that the weight is rising faster in group 2 than in groups 1 and 3.
We can also create boxplots in order to observe outliers and identify thresholds for filtration of outliers.
# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline Time 1).
RATSL8S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Glimpsing the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# Drawing a boxplot of mean versus Group
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), 64 days")
If we are to follow the recipe of the Datacamp course and MABS book precisely, I would as my next step filter out the outlier values in order to plot the data again. I can do so by cutting off values below 250 grams and above 550 grams. However, getting rid of the outlier of group 3 will be trickier - I might have to filter out rats under 500 grams in that group only.
Again, I want to stress the small sample size, since this poses considerable problems for drawing conclusions. I would even be willing to pose some questions relating to the design of the original study Should for instance groups 2 and 3 have been combined, since the rats in these groups are considerably larger than in group 1 and since the groups taken together would be equally large as group 1? Or should the groups have been mixed, so that an equal number of small and large rats would have been given the same diet?
# Filtering out outliers (rats weighting more than 550 and less than 250 grams)
RATSL8S1 <- RATSL8S %>% filter(RATSL8S$mean<550 & RATSL8S$mean > 250)
# Filtering the last outlier (in group 3, weighting less than 500 grams)
RATSL8S1 <- RATSL8S1[-which(RATSL8S1$Group == 3 & RATSL8S1$mean < 500),]
# Inspecting the summary reveals that one outlier has been eliminated in every group
summary(RATSL8S1)
## Group ID mean
## 1:7 1 :1 Min. :261.7
## 2:3 3 :1 1st Qu.:267.5
## 3:3 4 :1 Median :276.2
## 5 :1 Mean :373.3
## 6 :1 3rd Qu.:457.5
## 7 :1 Max. :542.2
## (Other):7
# Plotting the data again without outliers
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), 64 days")
# Inspecting summaries by group
RATSL8S1 %>%
group_by(Group) %>%
summarize(mean = mean(mean))
## # A tibble: 3 × 2
## Group mean
## <fct> <dbl>
## 1 1 269.
## 2 2 452.
## 3 3 538.
The plot drawn without outliers confirms largely what we knew from before: that group 1 consists of small rats, group 2 of mid-size and group 3 of large rats, weighting 269, 452 and 538 grams on average. The variance within groups is very small, partly due to the small sample size and the subsequent filtration of outliers.
Let’s do some calculations. Since I have three groups, I will not be performing simple t-tests between the groups since this could lead to the accumulation of type I errors. Instead, I’ll rely on ANOVA for testing.
The analysis confirms that the baseline value for Weight is strongly related to the Weight values taken on day 1 (p < 0.001). There is some evidence for the diet given to Group 2 being efficient (p = 0.07586). Printing a regression summary of the fit confirms that the diet given to Group 3 has no statistically significant impact on the weights of rats.
# Adding the baseline. According to the instructions on the Data camp course site,
# we should be using the original data as the source for the new variable.
# Since I never imported the original data, I now have to extract the baseline from
# the imported data. (I did confirm that the extracted vector corresponds to that from the original data.)
RATSL8S2 <- RATSL8S %>%
mutate(baseline = RATSL[RATSL$Time == 1,]$Weight)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
# A command used to prompt the use of zeros instead of scientific notation for p values
options(scipen=999)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 0.0000000000000157 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Print a summary of the regression model for comparison.
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.905 -4.194 2.190 7.577 14.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.16375 21.87657 1.516 0.1554
## baseline 0.92513 0.08572 10.793 0.000000156 ***
## Group2 34.85753 18.82308 1.852 0.0888 .
## Group3 23.67526 23.25324 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.992
## F-statistic: 622.1 on 3 and 12 DF, p-value: 0.0000000000001989
The confidence intervals confirm that 95 per cent of rats in Group 2 gained up to 76 grams of weight or lost at the most 6 grams, when comparing their mean weights with the baseline. About as large weight gains were reported in groups 1 (visible as the Intercept) and 3 (up to 80 grams when comparing mean weights with baseline). However, in these groups there was a greater possibility of losing weight during the observation period.
# The confidence intervals (at 95 %) are given below.
confint(fit)
## 2.5 % 97.5 %
## (Intercept) -14.5011937 80.828690
## baseline 0.7383656 1.111899
## Group2 -6.1544447 75.869498
## Group3 -26.9892106 74.339724
In the second part of the exercise we will be exploring the BPRS dataset using the techniques employed in Chapter 9 of the course book. Vehkalahti and Everitt (2019: 157) summarize the psychiatric study that provided the data for the dataset: “40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.”
Again, we will begin by reading the data, converting categorical variables to factors, exploring the structure of the data and plotting the data in order to explore patterns.
# Reading the data
BPRSL <- read.csv("./data/bprsl.csv", stringsAsFactors = T)
# Converting categorical variables to factors
BPRSL$treatment <- as.factor(BPRSL$treatment)
BPRSL$subject <- as.factor(BPRSL$subject)
# Confirming the structure of the dataset
str(BPRSL)
## 'data.frame': 360 obs. of 4 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
# Viewing the data
head(BPRSL)
## treatment subject week bprs
## 1 1 1 0 42
## 2 1 2 0 58
## 3 1 3 0 54
## 4 1 4 0 55
## 5 1 5 0 72
## 6 1 6 0 48
# Plotting the BPRSL data
ggplot(BPRSL, aes(x = week, y = bprs, group = interaction(treatment, subject), inherit.aes = FALSE, colour = treatment)) +
geom_line() +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "BPRS score") +
theme(legend.position = c(0.85,0.85)) +
ggtitle("Impact of treatment on BPRS") +
theme(plot.title = element_text(hjust = 0.5))
From the plot above, it appears as if both treatments have a positive impact, generally lowering BPRS scores. However, simply by casting an eye on the chart it is hard to tell which treatment does a better job.
As a means of initial exploration, we can fit a linear regression model to try to understand the difference between the two models.
# creating a regression model
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRSL)
# printing out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <0.0000000000000002 ***
## week -2.2704 0.2524 -8.995 <0.0000000000000002 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 0.00000000000000022
From the summary of the regression model it can be seen that time (variable week) has a clearly significant impact. That is, the more weeks that pass, the lower the bprs score. However, we cannot find any difference between treatments, at least not when inspecting results using the regression model ignoring the repeated-measures structure.
Let’s move on to fit a random intercept model, which will allow the linear regression fit for each individual to differ in intercept from other individuals. We use the same explanatory variables: time (week) and treatment.
We will use the lme4 package, as instructed. The formula adheres to the basic linear regression formula standards, with the addition of random-effects terms distinguished by vertical bars (|).
# accessing the lme4 library
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# creating a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Printing the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
The printout contains a large number of interesting facts. Perhaps the most significant which I would like to pay attention to is the variability of the subjects’ scores, expressed by the standard deviation of 6.885 scores.
In the following, we can fit a random intercept and random slope model to the bprs data. This will allow the linear regression fits for each individual to differ in intercept and in slope, which will make it possible to account for the individual differences in the subjects’ developments and the effect of time.
# creating a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# printing a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the printout of the random intercept and random slope model, the individual standard deviation has grown to over 8 now. The ANOVA test of the two models suggests that the random intercept and random slope model (BPRS_ref1) makes a better fit against the random intercept model (p = 0.026). Thus, the individual slopes seem to have an impact.
We can now fit a random intercept and slope model that allows for interaction between time and treatment.
# creating a random intercept and random slope model with an interaction term
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + week*treatment, data = BPRSL, REML = FALSE)
# printing out a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# performing an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + week * treatment
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the summary, it does not appear as if the interaction term would have any significant effect on the model as a whole (t value 1.785). Neither does treatment have any statistically significant impact (t value -1.2).
As a final step, we can reprint the original observed values study how they compare with the fitted values of the model. As it appears, the trend visible in the observed values is even more clearly visible in the model using the fitted values. I would conclude that time spent in care clearly has an effect on the brief psychiatric rating scale score measured among the 40 men here. Whether the decline is due to the treatment is unclear. From the data analysed here, it cannot be concluded that one treatment would be better than the other, or that the treatment would be effective in the first place. For us to be able to advance such a claim, we would have needed a control group receiving no treatment at all.
# Creating a vector of the fitted values
fitted <- fitted(BPRS_ref2)
# Create a new column fitted to BPRSL
BPRSL <- mutate(BPRSL, fitted=fitted)
# Reprinting the plot with observed BPRS values (for the sake of comparison)
ggplot(BPRSL, aes(x = week, y = bprs, group = interaction(treatment, subject), colour = treatment)) +
geom_line() +
scale_x_continuous(name = "Time (weeks)") +
scale_y_continuous(name = "BPRS score") +
theme(legend.position = c(0.85,0.85)) +
ggtitle("Observed BPRS values") +
theme(plot.title = element_text(hjust = 0.5))
# plotting the fitted values
ggplot(BPRSL, aes(x = week, y = fitted, group = interaction(treatment, subject), colour = treatment)) +
geom_line() +
scale_x_continuous(name = "Time (weeks)") +
scale_y_continuous(name = "BPRS score") +
theme(legend.position = c(0.85,0.85)) +
ggtitle("Fitted BPRS values") +
theme(plot.title = element_text(hjust = 0.5))